These materials have been taken from the Software Carpentry: R Novice Lesson. You can find the original materials here


This lesson will cover some basic functions that can be used to manipulate data in R. Again, we will be using the gapminder data set, which includes country information on GDP, population, etc.


There are five main functions we’ll be talking about today, each allowing us to manipulate data frames. These five functions are:


If you haven’t already, make sure you have dplyr() and gapminder() installed and loaded with the following commands:

# Download the packages
# install.packages(c("dplyr", "gapminder"))

# Load the packages for use
library(dplyr)
library(gapminder)


Let’s take a quick look at our data frame to remind ourselves of its structure. We do this using the head() command, which will display the first 10 rows (given by n = 10) of our data frame.

head(gapminder, n = 10)
# A tibble: 10 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.


Choose Columns: select

The first function we’ll be using is select(). This function let’s us pick columns from our data frame, based on name (e.g. year) or by index (e.g. 3).

Let’s try using select() to pick out a few columns: “country”, “year”, “lifeExp”, and “pop”. We’ll be assigning these columns to a new data frame, gapminder_select. Then we’ll use head() to see if it worked.

# select() code here
gapminder_select <- select(gapminder, country, year, lifeExp, pop)

# Check the data frame
head(gapminder_select, n = 10)
# A tibble: 10 x 4
   country      year lifeExp      pop
   <fct>       <int>   <dbl>    <int>
 1 Afghanistan  1952    28.8  8425333
 2 Afghanistan  1957    30.3  9240934
 3 Afghanistan  1962    32.0 10267083
 4 Afghanistan  1967    34.0 11537966
 5 Afghanistan  1972    36.1 13079460
 6 Afghanistan  1977    38.4 14880372
 7 Afghanistan  1982    39.9 12881816
 8 Afghanistan  1987    40.8 13867957
 9 Afghanistan  1992    41.7 16317921
10 Afghanistan  1997    41.8 22227415


As you can see, our new data frame contains only a subset of the columns from the original data frame, based on the names we provided in the select() command.


Here we’ll also introduce another great feature of dplyr(): the pipe ( %>% ). This symbol sends or pipes an object (e.g. a data frame like gapminder) INTO a function (e.g. select()).

So, the above select() command can be rewritten as follows (NOTE: the “.” is a placeholder, which represents the object being piped). Again, we can check our result using head().

# select() using pipe syntax
gapminder_pipe <- gapminder %>% select(., country, year, lifeExp, pop)

head(gapminder_pipe, n = 10)
# A tibble: 10 x 4
   country      year lifeExp      pop
   <fct>       <int>   <dbl>    <int>
 1 Afghanistan  1952    28.8  8425333
 2 Afghanistan  1957    30.3  9240934
 3 Afghanistan  1962    32.0 10267083
 4 Afghanistan  1967    34.0 11537966
 5 Afghanistan  1972    36.1 13079460
 6 Afghanistan  1977    38.4 14880372
 7 Afghanistan  1982    39.9 12881816
 8 Afghanistan  1987    40.8 13867957
 9 Afghanistan  1992    41.7 16317921
10 Afghanistan  1997    41.8 22227415


We can actually simplify the above command further - dplyr’s functions such as select() are smart enough that you don’t actually need to include the “.” placeholder, as shown below.

# select() using pipe syntax w/out a placeholder
gapminder_pipe2 <- gapminder %>% select(country, year, lifeExp, pop)

head(gapminder_pipe2, n = 10)
# A tibble: 10 x 4
   country      year lifeExp      pop
   <fct>       <int>   <dbl>    <int>
 1 Afghanistan  1952    28.8  8425333
 2 Afghanistan  1957    30.3  9240934
 3 Afghanistan  1962    32.0 10267083
 4 Afghanistan  1967    34.0 11537966
 5 Afghanistan  1972    36.1 13079460
 6 Afghanistan  1977    38.4 14880372
 7 Afghanistan  1982    39.9 12881816
 8 Afghanistan  1987    40.8 13867957
 9 Afghanistan  1992    41.7 16317921
10 Afghanistan  1997    41.8 22227415


Challenge 1

Using the select() command and pipe (%>%) notation, pick the following columns from the gapminder data frame, assign them to a new variable (we’ll use x), and display the results using head(x, n = 10). Columns to choose are:

  • continent
  • GDP per capita
  • life expectancy
  • year
# Answer here:
x <- select()


Choose Rows: filter

So we’ve covered selecting columns, but what about rows? This is where filter() comes in. This function allows us to choose rows from our data frame using some logical criteria. An example is filtering for rows in which the country is Canada. This can also be applied to numerical values, such as the year being equal to 1967, or life expectancy greater than 30.

NOTE: In R, equality (e.g. country is Canada, year is 1967) is done using a double equals sign (==).

Let’s go through a couple examples.

# Filter rows where country is Canada
gapminder_canada <- gapminder %>% filter(country == "Canada")

head(gapminder_canada, n = 10)
# A tibble: 10 x 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Canada  Americas   1952    68.8 14785584    11367.
 2 Canada  Americas   1957    70.0 17010154    12490.
 3 Canada  Americas   1962    71.3 18985849    13462.
 4 Canada  Americas   1967    72.1 20819767    16077.
 5 Canada  Americas   1972    72.9 22284500    18971.
 6 Canada  Americas   1977    74.2 23796400    22091.
 7 Canada  Americas   1982    75.8 25201900    22899.
 8 Canada  Americas   1987    76.9 26549700    26627.
 9 Canada  Americas   1992    78.0 28523502    26343.
10 Canada  Americas   1997    78.6 30305843    28955.


Let’s try another one, this time filtering on life expectancy above a certain threshold:

# Filter for rows where life expectancy is greater than 50
gapminder_LE <- gapminder %>% filter(lifeExp > 50)

head(gapminder_LE, n = 10)
# A tibble: 10 x 6
   country continent  year lifeExp     pop gdpPercap
   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
 1 Albania Europe     1952    55.2 1282697     1601.
 2 Albania Europe     1957    59.3 1476505     1942.
 3 Albania Europe     1962    64.8 1728137     2313.
 4 Albania Europe     1967    66.2 1984060     2760.
 5 Albania Europe     1972    67.7 2263554     3313.
 6 Albania Europe     1977    68.9 2509048     3533.
 7 Albania Europe     1982    70.4 2780097     3631.
 8 Albania Europe     1987    72   3075321     3739.
 9 Albania Europe     1992    71.6 3326498     2497.
10 Albania Europe     1997    73.0 3428038     3193.


We can also filter with multiple arguments, each separated by a comma:

# filter() for Canada and life expectancy greater than 80
gapminder_C_LE <- gapminder %>% filter(country == "Canada", lifeExp > 80)

head(gapminder_C_LE, n = 10)
# A tibble: 1 x 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Canada  Americas   2007    80.7 33390141    36319.


Challenge 2

Use filter() to choose data for African countries, from the year 1980 and onwards.

# Challenge 2 code here
x <- filter()


Create New Columns: mutate()

Let’s say we now want to calculate the GDP in billions, which is done by mutiplying the GDP per capita by the population, then dividing by 1 billion (1 * 10^9). mutate() will perform this calculation on each row in the data frame, one row at a time (i.e. row-wise). The code below will calculate the GDP in billions:

# Use mutate() to calculate GDP in billions
gapminder_gdpBil <- gapminder %>% mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_gdpBil, n = 10)
# A tibble: 10 x 7
   country     continent  year lifeExp      pop gdpPercap gdp_billion
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>       <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.        6.57
 2 Afghanistan Asia       1957    30.3  9240934      821.        7.59
 3 Afghanistan Asia       1962    32.0 10267083      853.        8.76
 4 Afghanistan Asia       1967    34.0 11537966      836.        9.65
 5 Afghanistan Asia       1972    36.1 13079460      740.        9.68
 6 Afghanistan Asia       1977    38.4 14880372      786.       11.7 
 7 Afghanistan Asia       1982    39.9 12881816      978.       12.6 
 8 Afghanistan Asia       1987    40.8 13867957      852.       11.8 
 9 Afghanistan Asia       1992    41.7 16317921      649.       10.6 
10 Afghanistan Asia       1997    41.8 22227415      635.       14.1 


Combine Functions with Pipes

We’ve seen that pipes ( %>% ) can be used to send an object such as a data frame into a function, such as select(), or filter(). But they can also be used to send the output of one function into another function. This allows us to chain together multiple commmands, without the need for intermediate variables.

Let’s take a look at this in an example.

# select() the five columns, and filter() for Canada
gapminder_multi <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada")

head(gapminder_multi, n = 10)
# A tibble: 10 x 5
   country  year lifeExp      pop gdpPercap
   <fct>   <int>   <dbl>    <int>     <dbl>
 1 Canada   1952    68.8 14785584    11367.
 2 Canada   1957    70.0 17010154    12490.
 3 Canada   1962    71.3 18985849    13462.
 4 Canada   1967    72.1 20819767    16077.
 5 Canada   1972    72.9 22284500    18971.
 6 Canada   1977    74.2 23796400    22091.
 7 Canada   1982    75.8 25201900    22899.
 8 Canada   1987    76.9 26549700    26627.
 9 Canada   1992    78.0 28523502    26343.
10 Canada   1997    78.6 30305843    28955.


We can further expand on this by incorporating our mutate() command from earlier, linking multiple functions into a single command. Be sure to indent (TAB key) when moving to a new line after a pipe.

# select() the four columns, filter() for Canada, and calculate GDP in billions
gapminder_multi_2 <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada") %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_multi_2, n = 10)
# A tibble: 10 x 6
   country  year lifeExp      pop gdpPercap gdp_billion
   <fct>   <int>   <dbl>    <int>     <dbl>       <dbl>
 1 Canada   1952    68.8 14785584    11367.        168.
 2 Canada   1957    70.0 17010154    12490.        212.
 3 Canada   1962    71.3 18985849    13462.        256.
 4 Canada   1967    72.1 20819767    16077.        335.
 5 Canada   1972    72.9 22284500    18971.        423.
 6 Canada   1977    74.2 23796400    22091.        526.
 7 Canada   1982    75.8 25201900    22899.        577.
 8 Canada   1987    76.9 26549700    26627.        707.
 9 Canada   1992    78.0 28523502    26343.        751.
10 Canada   1997    78.6 30305843    28955.        878.


Calculations with group_by and summarise

These functions allow us to work on our data in specific groups. For example, we can use group_by() to group observations by country, then calculate the average life expectancy for each country.

# group_by() country, calculate average life expectancy
gapminder_grp <- gapminder %>% 
group_by(country) %>% 
summarise(mean(lifeExp))

head(gapminder_grp, n = 10)
# A tibble: 10 x 2
   country     `mean(lifeExp)`
   <fct>                 <dbl>
 1 Afghanistan            37.5
 2 Albania                68.4
 3 Algeria                59.0
 4 Angola                 37.9
 5 Argentina              69.1
 6 Australia              74.7
 7 Austria                73.1
 8 Bahrain                65.6
 9 Bangladesh             49.8
10 Belgium                73.6


Let’s do another example, again grouping by country. This time, we’ll calculate the mean and standard deviation of the GDP per capita. We’ll also specify the column names inside of the summarise() command.

gapminder_mean_sd <- gapminder %>% 
    group_by(country) %>% 
    summarise(mean_gdp = mean(gdpPercap), sd_gdp = sd(gdpPercap))

head(gapminder_mean_sd, n = 10)
# A tibble: 10 x 3
   country     mean_gdp sd_gdp
   <fct>          <dbl>  <dbl>
 1 Afghanistan     803.   108.
 2 Albania        3255.  1192.
 3 Algeria        4426.  1310.
 4 Angola         3607.  1166.
 5 Argentina      8956.  1863.
 6 Australia     19981.  7815.
 7 Austria       20412.  9655.
 8 Bahrain       18078.  5415.
 9 Bangladesh      818.   235.
10 Belgium       19901.  8391.


Combining tables with left_join()

Let’s say you have a data frame you’ve generated, containing a list of gene IDs and some corresponding value, such as expression. And you have another table that contains many (e.g. all) human gene IDs, as well as the names for those genes. For example the gene ID “ENSG00000012048” corresponds to the gene BRCA1. Now let’s say you want to map between the two tables using the gene ID, to create a single table with ID-Expression-Name for each gene. That’s what left_join() is for!

This function takes one data frame “x” and using a specified column, looks for matching entries in “y”. Note that the output data frame will contain all rows and columns from “x”, as well as all columns from “y”, but only matching rows from “y”.

First let’s make some simple example data to play with:

df_1 <- data.frame(FruitID = c("f1", "f2", "f3", "f4", "f5"),
                   Fruit_Exp = c(1, -1, 0, 4, -8))

df_2 <- data.frame(FruitID = c("f1", "f2", "f3", "f4", "f5", "f6", "f7"),
                   FruitName = c("apple", "orange", "banana", "pear", "blueberry", "melon", "peach"))

head(df_1)
  FruitID Fruit_Exp
1      f1         1
2      f2        -1
3      f3         0
4      f4         4
5      f5        -8
head(df_2)
  FruitID FruitName
1      f1     apple
2      f2    orange
3      f3    banana
4      f4      pear
5      f5 blueberry
6      f6     melon

Now we can use left_join() to combine the two tables, based on matching values in a specified column. The syntax is as follow:

left_join(df_1, df_2, by = "FruitID")
  FruitID Fruit_Exp FruitName
1      f1         1     apple
2      f2        -1    orange
3      f3         0    banana
4      f4         4      pear
5      f5        -8 blueberry


Extra: Note that you can have different column names in each of your data frames, and still join the tables together. The syntax for this is:

left_join(x, y, by = c("columnX" = "columnY"))


Tying it all together

Now let’s use all the commands we’ve covered and combine them with pipes into a single statement.

Let’s say we want calculate the mean and SD of the GDP (in billions) for each country, but only considering data from 1980 and onwards. We can accomplish this all in one step as follows.

# select() columns, filter() by year, calculate GDP in billions, mean() and sd() of GDP in billions
gapminder_final <- gapminder %>% 
    select(country, year, pop, gdpPercap) %>% 
    filter(year >= 1980) %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9) %>% 
    group_by(country) %>% 
    summarise(mean_gdpBillion = mean(gdp_billion), sd_gdpBillion = sd(gdp_billion))

head(gapminder_final, n = 10)
# A tibble: 10 x 3
   country     mean_gdpBillion sd_gdpBillion
   <fct>                 <dbl>         <dbl>
 1 Afghanistan            16.4          7.66
 2 Albania                13.1          4.84
 3 Algeria               149.          33.2 
 4 Angola                 28.9         15.5 
 5 Argentina             353.          91.5 
 6 Australia             478.         154.  
 7 Austria               225.          50.3 
 8 Bahrain                12.4          5.15
 9 Bangladesh            120.          54.3 
10 Belgium               272.          54.6